A Decade of Togetherness: Uncovering Sentiments and Trends in WhatsApp Group Chats¶
By Saurabh Kudesia | Aug 2025¶
© 2025 Saurabh Kudesia
This project is licensed under the MIT License. You are free to use, modify, and distribute this code, provided you include proper attribution and retain the license notice.
Image Courtsey: Unsplash.com
Background¶
In today's digital age, group chats have become dynamic reflections of social connection—capturing shared memories, humor, opinions, and emotional shifts. This project explores a WhatsApp group chat among batchmates to uncover patterns in communication, sentiment, and engagement over time.
By applying Natural Language Processing (NLP) and data visualization techniques, the analysis reveals how conversations evolve, which topics spark interaction, and how emotions ebb and flow in response to events. Ultimately, the project provides a data-driven glimpse into how a close-knit peer community communicates, bonds, and grows across a decade of digital dialogue.
Objectives¶
- Analyze Sentiment Trends: Identify emotional tones (positive, negative, neutral) and how they shift over time.
- Detect Conversation Peaks: Pinpoint days or events with unusually high activity or emotional intensity.
- Identify Active Participants: Highlight the most engaged members based on message volume and frequency.
- Discover Trending Topics: Use keyword analysis to uncover recurring themes and high-engagement discussions.
- Visualize Communication Patterns: Track how message frequency and sentiment change over days, weeks, and months.
- Understand Engagement Behavior: Analyze patterns such as peak activity hours, common media types, and link sharing habits.
Data Dictionary¶
- Data Source: Exported WhatsApp chat file (.txt) from a group of batchmates.
- Time Span: Messages exchanged from 1 Oct 2018 - 25 Jul 2025.
- Message Types: Messages (28671), images (500), videos (109) 6 Audio (6) contacts (18), documents (10), and spreadsheets (3).
Terminology & Conventions¶
This project uses the following icons and labels to highlight key types of supplementary information:
- 💡 Data Insights: Highlights observations, best practices, or useful tips related to the data.
- 🚫 Limitations: Indicates known constraints, issues, or areas where the data or analysis may be incomplete or unreliable.
- 🔍 Analytical Insights: Explains reasoning behind a method, interpretation of results, or patterns uncovered through analysis.
Analytical Framework & Toolchain Overview¶
This project employs a multi-layered analytical framework that blends classic data science, modern Natural Language Processing (NLP), and deep learning—underscoring the complexity and depth of the methodology. Rather than relying on a single technique or toolkit, the analysis unfolds across several broad and interdependent themes:
-
Data Engineering & Parsing: Raw WhatsApp exports are unstructured and noisy. Through a combination of file system operations, hashing, regex parsing, and structured data tools like pandas, the project performs intensive preprocessing to convert years of chaotic conversation into analyzable formats. This foundation supports all subsequent layers of analysis.
-
Linguistic Processing & Text Normalization: Handling multilingual, emoji-laden, colloquial chat data is non-trivial. The project uses a pipeline of language detection, stemming, stopword filtering, and emoji interpretation—alongside tools like NLTK and langdetect—to normalize text while preserving semantic nuance.
-
Emotional & Semantic Mapping: By combining rule-based sentiment models (SentimentIntensityAnalyzer) with vectorized text representations (TfidfVectorizer, SentenceTransformer), the project maps the emotional tone and thematic drift of group interactions over time. This adds psychological depth to the statistical backbone.
-
Machine Learning & Classification: Supervised learning models are trained to classify themes, detect user roles, and track topic clusters. Clustering (e.g., KMeans) and dimensionality reduction (PCA) are used to distill insights from high-dimensional data, revealing latent behavioral trends.
-
Multimodal Intelligence: Beyond text, the project integrates shared media. With state-of-the-art transformer models like BLIP and CLIP, it performs caption generation, image-text alignment, and semantic tagging, making this not just a chat analysis—but a multimodal communication study.
-
Narrative Visualization: Static plots, dynamic charts (plotly), word clouds, and annotated timelines collectively turn analytical output into intuitive visual narratives. This enables exploration of interactions not just numerically, but as evolving social stories.
In essence, this project fuses data engineering, NLP, machine learning, multimodal AI, and visual storytelling to decode the patterns of connection, sentiment, and behavior in long-term group chat data. The methodology reflects both computational sophistication and creative curiosity, transforming casual digital chatter into a meaningful social dataset.
Data Constraints and Analytical Caveats¶
-
Incomplete Historical Coverage
The dataset begins on 1 October 2018, corresponding with the author’s entry into the group. Consequently, interactions and trends between the group’s inception on 29 August 2015 and this start date are not captured in the analysis.
-
Data Currency & Scope
The data is current as of 25 July 2025 and the analysis includes only those members who have been active (i.e., posted at least once) since the dataset began, and therefore does not reflect the full history or membership of the group over time.
-
User Identity Tracking Limitations
WhatsApp identifies users by phone numbers, which can change over time. If a user has switched phone numbers during the analysis period, they may be counted as two distinct participants. As a result, the number of unique users identified in the dataset may not accurately match the actual number of distinct group members.
-
Admin Role Uniformity
Since all members are designated as administrators, the analysis does not differentiate between admin and non-admin behavior. This limits the ability to explore leadership dynamics or role-based engagement patterns.
-
Lack of Demographic Information
The analysis lacks access to demographic variables such as age, gender, location, or professional background, as this information is not available from WhatsApp. This constrains any exploration of how such factors may influence communication behavior or participation levels.
In the absence of additional user-provided details, this analysis estimates the geographic distribution of users based solely on the country codes extracted from their phone numbers. While this provides a reasonable approximation of user location, it may not reflect actual residency or physical presence.
-
Multilingual Communication Complexity
The group's communication includes English, Indian languages, and code-mixed content, which poses challenges for natural language processing (NLP). Sentiment analysis, keyword extraction, and topic modeling may have reduced accuracy due to code-switching, informal phrasing, and non-standard syntax.
-
Modeling Limitations
The analytical tools employed may not accurately detect sarcasm, humor, irony, or subtle bias. As a result, some emotional tones or cultural nuances may be misinterpreted or overlooked in the analysis.
-
Contextual Interpretation
This project is intended as an exploratory, informal analysis for insights and engagement—not as a formal audit or evaluation. The findings are not prescriptive and should not be interpreted as critiques of individual members or group governance.
-
Illustrative Use of Personal Examples
While significant effort was made to anonymize all data to protect participant privacy, certain illustrative examples may include intentional references to known group content or expressions. These are used respectfully and strictly for contextual clarity, without judgment or bias.
Executive Summary¶
This analysis offers a comprehensive view of a dynamic WhatsApp group spanning 6.8 years, capturing 28,671 messages from 63 unique contributors. The findings reflect a resilient, emotionally intelligent, and professionally active digital community, offering key insights for community management, engagement optimization, and scalable communication strategy.
Key Highlights¶
Community Engagement & Participation¶
- High Activity: 28,671 messages, averaging 4,200+ messages annually.
- Broad Involvement: 105% participation rate (63 participants vs. 60 current members) reflects sustained engagement across time, including former members.
- Contributor Distribution: Top 20 users drive 60-70% of content, revealing a core-periphery structure with potential for middle and lower-tier activation.
Temporal & Behavioral Patterns¶
- Peak Activity Windows: Business hours (10–11 AM, 3–5 PM) on weekdays dominate engagement, though weekend participation remains consistent.
- Stable Over Time: No major drop-offs, with message volume aligned to group milestones and event-driven peaks.
Communication Style & Sentiment¶
- Balanced Messaging: Combination of short, informal texts and long, context-rich posts fosters both agility and depth.
- Positive Sentiment Dominance: Rich in appreciation, recognition, and celebration—reinforced by consistent emoji use (🎉, 🙏, 😊).
- Low Conflict Zone: Minimal negative sentiment and high psychological safety mark a healthy, trust-based environment.
Cultural & Linguistic Identity¶
- English-Led Multilingualism: English (85.7%) is dominant, but Hinglish and Hindi add regional nuance and relatability.
- Code-Mixing Patterns: Informal and context-driven language use reflects an Indian professional and alumni network context.
Media & File Sharing¶
- Rich Content Diversity: 500+ images, videos, and documents enhance communication quality and engagement.
- Professional & Social Balance: Shared media supports learning, collaboration, and celebration—reinforcing both knowledge and community bonds.
Influencers & Recognition¶
- Certain individuals (e.g., Sender_036, Sender_011) are central to discussion and recognition, playing informal leadership roles. These contributors are key to tone-setting, morale boosting, and knowledge sharing.
Challenges Identified¶
- Participation Inequality: Core group dominates activity; 33% of users show low engagement.
- Content Overload Risk: High message volume may lead to information fatigue.
- Dependence on Key Users: Heavy reliance on a few contributors for momentum and leadership.
Strategic Recommendations¶
Engagement Equity¶
- Reactivate the bottom 20% through targeted outreach.
- Empower mid-tier users with light leadership roles and recognition.
- Prevent burnout in top contributors via rotating responsibilities.
Communication Optimization¶
- Schedule key updates during peak hours.
- Guide discussion flow with prompts and structured topics.
- Maintain language inclusivity with English for clarity and Hinglish/Hindi for cultural resonance.
Culture & Sentiment¶
- Reinforce positivity through visible appreciation and shared celebration.
- Monitor emotional tone to detect early signs of disengagement or group fatigue.
Knowledge & Media Management¶
- Standardize and tag valuable shared content.
- Promote high-signal media (infographics, short videos, documents).
- Summarize key threads to improve retention and accessibility.
Analytics & Governance¶
- Introduce dashboards and KPIs (e.g., participation equality, engagement per user).
- Upgrade sentiment and language detection tools for deeper insights.
- Establish privacy-respecting feedback loops and ethical data practices.
Strategic Implications¶
- The group serves as a model for high-performing digital communities, with relevance for alumni networks, professional forums, or remote teams.
- There is strong scalability potential, provided participation is balanced, content is curated, and engagement is guided by data.
- Investing in analytics, recognition systems, and inclusive leadership will be critical for sustainable growth.
Conclusion¶
This WhatsApp group exemplifies mature digital community dynamics—where emotional intelligence, knowledge exchange, and cultural authenticity intersect. With deliberate action to address participation disparities and content organization, the group can evolve into a replicable blueprint for resilient, scalable, and inclusive online communities.
Exploratory Data Analysis¶
Group Dynamics & Participation¶
What Percentage of Participants post?¶
⚠️ Active participants exceed total declared participants.
🚫 Limitations
The higher number of
active_participantsreflects the full history of group activity, not just the present membership snapshot.While it may seem counterintuitive, the number of unique active participants (i.e., message senders) in a WhatsApp group can exceed the current total group members. This usually happens due to one or more of the following reasons:
- Outdated or Manually Entered Group Size: The participants count may reflect the current or recent group size, but the chat export typically includes historical messages, including those from users who have since left the group. These past participants are still counted as active senders.
- Multiple Sender IDs for the Same Person: If a participant changed their phone number or left and rejoined, WhatsApp may log them as separate users, increasing the count of unique senders. System or Non-member Messages: Message logs sometimes include System messages (e.g., "You added X"), Temporary participants or WhatsApp Business accounts, Unknown numbers not saved in contacts. All of these can be mistakenly counted as unique senders even if they are not current members.
Who are the most and least active participants?¶
At what times is the group most active?¶
When do the top 10 posters usually post?¶
How has group activity changed over time?¶
How has each user's activity changed year over year?¶
💡Data Insights
- Group Composition
Out of a current group size of 60 members, there have been 63 unique active participants, reflecting a 105% participation rate. This surplus is likely due to three previously active members who have since left the group. With an average of 455 messages per active participant, the group demonstrates a high level of meaningful and sustained engagement.
- Participation Distribution
The Pareto Principle is clearly evident as the top 20 users account for approximately 60-70% of all messages. This highlights a significant engagement disparity between highly active and minimally active members.Top 20 Contributors serve as the core influencers, consistently driving discussions and sharing valuable insights. Bottom 20 Contributors show low participation, representing an opportunity for re-engagement through targeted strategies.
- Daily & Weekly Trends
Engagement is heavily concentrated during business hours (9 AM - 6 PM), with peak activity observed in two key windows: Mid-morning (10-11 AM) and Late afternoon (3-5 PM). Activity is strongest during weekdays (Monday to Friday), reflecting professional use patterns. A sustained weekend presence suggests that members maintain a flexible, work-life integrated approach to participation.
- Long-Term Trends
Over the course of years, message volume has remained consistently strong, indicating community durability and long-term engagement. Event-driven spikes coincide with major updates, discussions, or announcements. The absence of significant drop-off points to the group's relevance and continued value over time.
User Behavior Segmentation
- Top Contributors: These individuals drive the community's pulse, contributing 60-70% of overall activity. They frequently share high-value content, lead discussions, and influence group culture. Their continued participation is critical, making them a strategic priority for retention and recognition.
- Moderate Contributors: Serving as the reliable backbone of the group, moderate contributors engage regularly without dominating conversations. As growth candidates, they present ideal targets for subtle nudges to deepen participation.
- Low Contributors: Comprising roughly 33% of the group, these underutilized members exhibit minimal activity. While their churn risk may be moderate, they represent a strategic opportunity for targeted re-engagement through personalized outreach or reactivation campaigns.
Language & Communication Style¶
What languages are used in the group?¶
Detecting the language of WhatsApp messages poses unique challenges due to the informal, multilingual, and often code-mixed nature of the content. Many users switch between languages mid-sentence (e.g., Hindi and English), use transliterations (e.g., "tum kya kar rahe ho?"), and include slang, abbreviations, or emojis that confuse traditional language detection models. Standard tools like langdetect or rule-based approaches struggle with such content, often misclassifying short messages or defaulting to incorrect language predictions when messages are only a few words long or contain mixed-language constructs.
To address these limitations, we will use FastText pre-trained language identification model developed by Facebook AI Research. The FastText’s lid.176.ftz model is trained on a vast corpus of short texts in 176 languages, making it significantly more robust for social media or chat-based inputs. It performs well even with minimal context and can handle noisy text and varied syntax more gracefully than traditional models. In this project, leveraging FastText ensures higher accuracy and reliability in language tagging, which is crucial for downstream analysis like sentiment detection, message categorization, or regional engagement insights.
🚫 Limitations
This analysis visualizes the distribution of languages used in WhatsApp group messages by leveraging a language detection model applied to each message. This provides a quick overview of the linguistic diversity in the group and highlights dominant languages used in conversations.
However, this method comes with notable limitations. WhatsApp messages are often short, informal, or filled with emojis, abbreviations, and spelling variations—all of which can confuse language detection models and reduce accuracy. Multilingual messages (common in informal chats) are often classified based on just a few dominant words, leading to oversimplification. Additionally, code-switching (switching languages mid-sentence) is not captured here, and the method assigns one language per message, which may not reflect the true linguistic blend.
Thus, while the chart gives a general sense of language use, it should be interpreted with caution, especially in multilingual or informal communication settings.
Which languages tend to appear together in posts?¶
Which languages are most frequently used?¶
How common is language-switching in messages?¶
🔍 Analytical Insights
To address the limitations of single-language detection in WhatsApp messages, we will conduct a refined analysis to capture code-mixing and multilingual usage. In this approach, we will extract a list of possible languages with confidence scores. For each non-emoticon message, we extract languages with a confidence above 5%, allowing us to identify multiple languages used within a single message. The results offers a more nuanced view of language diversity.
Despite its improved granularity, this method has important limitations. The accuracy of language detection drops significantly for short or noisy texts—common in WhatsApp chats. It may also misclassify informal, transliterated, or regionally mixed language content (e.g., Hinglish, Taglish). Furthermore, it doesn't account for the order or structure of languages used within the message, and confidence-based thresholds may still include spurious detections.
Thus, while this approach captures multilingual tendencies better than single-label models, results should still be interpreted qualitatively alongside linguistic context.
What is each user’s preferred language?¶
This section identifies the primary language used by each participant based on the content of their messages. By analyzing the language distribution at the individual level, we can infer users' language preferences, which may reflect their background, communication style, or target audience. Understanding language preference enables more inclusive group analysis and helps uncover multilingual dynamics within the group.
How long are messages on average?¶
How does message length vary year to year?¶
Who writes long messages and who keeps it short?¶
How verbose are the top 10 users over time?¶
💡Data Insights
- Multilingual Communication & Cultural Expression
This analysis uncovers nuanced language usage patterns within the group, offering key insights into its cultural identity, communication flexibility, and community inclusivity. These are critical factors for content strategy and group management.
Language Usage Overview
- Dominant Language: English is the primary medium, used in 85.7% of all detected messages—underscoring its role as the group’s default for clarity, professionalism, and broad accessibility.
- Undetected Content (12.9%): A significant portion of messages could not be conclusively classified due to:
- Informal abbreviations and slang
- Mixed-language or hybrid expressions (e.g., Hinglish, Taglish)
- Emoji-heavy content, links, or technical strings
This highlights the limitations of automated NLP tools in decoding casual and non-standard digital communication.
- Minority Language Presence
- Hinglish (0.4%) and Hindi (0.3%) reflect regional flavor and informal expression, enhancing relatability among Indian users.
- Multilingual/Other (0.02%) entries suggest isolated use of other languages, likely context-specific or user-driven.
- Code-Switching Behavior
The group demonstrates dynamic language switching, with members blending English and Indian languages—especially Hindi—to tailor tone, express cultural identity, or connect informally.
- Hinglish Patterns
The 95 Hinglish messages exemplify this informal, conversational code-mixing style, prevalent in Indian digital communities. It reinforces peer familiarity and social bonding, particularly in less formal exchanges.
- Linguistic Adaptability
Users show high fluency in adjusting language based on context, audience, and message intent—a sign of communication maturity and cultural sensitivity.
Communication Preferences:
- English dominates formal discussions, announcements, and informational sharing.
- Hindi/Hinglish surfaces in celebratory, emotional, or casual interactions—indicating comfort-driven expression and social cohesion.
Inclusivity & Accessibility
The multilingual environment enhances cultural authenticity while maintaining accessibility. Language diversity fosters a welcoming atmosphere, where participants can communicate in ways that feel natural and meaningful.
🚫 Analytical Limitations
Approximately 12.9% of messages remain unclassified/unknown due to:
- Non-standard syntax or spelling
- Emojis or image-only messages
- Unconventional transliterations
- Mixed-language structures beyond standard detection models
These constraints reflect the complexity of analyzing informal, multilingual digital discourse, suggesting a need for enhanced NLP models, manual validation, or context-aware review to fully capture the richness of such interactions.
Text & Word Analysis¶
How many words do messages typically contain?¶
What are the most frequently used words?¶
Which words does each user use the most?¶
Result Summary¶
| SenderID | Plain_Top_Words | Normalized_Top_Words | |
|---|---|---|---|
| 0 | Sender_001 | the, to, and, of, in, is, you, for, it, this | happy, one, https, india, birthday, good, many, day, people, please |
| 1 | Sender_002 | the, wishes, birthday, to, in, and, is, of, it, this | wishes, birthday, congratulations, mahesh, sardar, one, sunil, best, chinese, security |
| 2 | Sender_003 | the, to, in, and, of, https, com, is, file, attached | https, com, file, attached, www, img, jpg, happy, india, birthday |
| 3 | Sender_004 | the, to, and, of, in, you, is, it, for, with | happy, one, said, bday, https, time, man, day, good, many |
| 4 | Sender_005 | the, to, and, of, in, for, is, this, you, it | https, many, register, please, com, iimb, iimbaa, time, us, day |
| 5 | Sender_006 | the, you, very, to, thanks, wish, happy, and, of, birthday | thanks, wish, happy, birthday, lot, congratulations, research, india, adani, good |
| 6 | Sender_007 | the, to, and, of, is, in, for, it, be, this | https, one, com, india, day, us, people, also, good, like |
| 7 | Sender_008 | the, to, and, is, of, in, for, you, it, will | https, one, com, term, happy, birthday, www, long, know, get |
| 8 | Sender_009 | the, to, of, and, in, is, that, you, was, it | happy, birthday, india, one, us, day, many, years, https, world |
| 9 | Sender_010 | the, of, happy, and, day, in, to, many, returns, for | happy, day, many, returns, congratulations, thanks, discount, one, https, india |
| 10 | Sender_011 | the, to, you, and, of, is, in, it, birthday, happy | birthday, happy, thanks, wish, https, one, bhai, world, please, com |
| 11 | Sender_012 | to, and, the, on, in, for, of, you, happy, great | happy, great, day, years, birthday, last, modi, chinna, news, looks |
| 12 | Sender_013 | the, of, and, in, to, you, happy, is, day, many | happy, day, many, returns, wishes, best, thank, italy, chinese, congratulations |
| 13 | Sender_014 | the, to, and, of, in, for, you, your, happy, with | happy, birthday, please, us, thanks, one, https, time, share, com |
| 14 | Sender_015 | the, to, it, happy, in, and, is, of, this, birthday | happy, birthday, thanks, bde, guys, one, https, message, people, com |
| 15 | Sender_016 | the, to, and, of, in, is, for, this, it, you | birthday, happy, https, india, please, people, one, thanks, us, com |
| 16 | Sender_017 | the, of, to, in, and, is, this, with, that, for | india, us, also, karti, oxygen, company, one, iran, government, companies |
| 17 | Sender_018 | the, to, and, is, in, of, you, that, he, it | match, https, would, true, work, awesome, sure, old, please, com |
| 18 | Sender_019 | to, and, of, in, the, is, we, for, this, will | india, https, wk, happy, covid, birthday, free, please, one, www |
| 19 | Sender_020 | the, to, and, of, in, is, you, for, it, this | happy, birthday, people, day, one, https, us, would, get, time |
| 20 | Sender_021 | the, and, to, happy, birthday, in, is, of, you, it | happy, birthday, stay, blessed, congratulations, always, day, mahesh, thank, many |
| 21 | Sender_022 | the, to, in, vs, and, of, june, is, for, you | vs, june, com, india, risk, https, www, water, lemon, news |
| 22 | Sender_023 | is, the, to, and, happy, you, in, bday, it, of | happy, bday, thanks, good, please, one, https, com, get, okay |
| 23 | Sender_024 | the, to, of, and, in, is, you, that, it, this | https, time, people, sars, com, virus, one, business, take, india |
| 24 | Sender_025 | the, to, and, happy, for, birthday, thanks, is, have, in | happy, birthday, thanks, congrats, mahesh, guys, day, blessed, sunil, good |
| 25 | Sender_026 | to, is, and, of, the, with, you, she, we, from | guys, drive, may, people, even, need, many, happy, sunil, congratulations |
| 26 | Sender_027 | the, to, and, of, for, is, you, this, in, happy | happy, wish, https, register, bday, event, time, sunil, egmp, iimbaa |
| 27 | Sender_028 | happy, birthday, the, nice, and, you, to, have, best, great | happy, birthday, nice, best, great, congratulations, thanks, wishes, year, sunil |
| 28 | Sender_029 | the, and, of, to, you, happy, is, it, thanks, many | happy, thanks, many, birthday, day, dr, god, bless, hospital, sunil |
| 29 | Sender_030 | the, to, you, and, of, it, for, in, many, we | many, happy, wish, please, birthday, good, day, us, would, used |
| 30 | Sender_031 | happy, birthday, to, the, and, you, in, of, for, this | happy, birthday, congratulations, thank, mahesh, help, please, need, sridhar, best |
| 31 | Sender_032 | happy, next, have, congratulations, to, you, birthday, ketan, be, and | happy, next, congratulations, birthday, ketan, thanks, vishnu, time, bangalore, sunil |
| 32 | Sender_033 | the, to, is, and, of, in, are, be, for, https | https, youtu, people, india, market, like, us, one, get, com |
| 33 | Sender_034 | and, the, to, you, happy, of, for, is, this, in | happy, birthday, best, wishes, know, please, thank, proud, congratulations, thanks |
| 34 | Sender_035 | happy, birthday, and, for, you, congratulations, to, year, annuity, the | happy, birthday, congratulations, year, annuity, dear, one, mahesh, thanks, ahead |
| 35 | Sender_036 | the, to, and, of, is, in, for, it, you, are | wishes, birthday, mask, year, great, ahead, https, india, com, one |
| 36 | Sender_037 | happy, many, returns, and, thank, you, all, to, wish, sunil | happy, many, returns, thank, wish, sunil, great, mahesh, year, day |
| 37 | Sender_038 | you, many, thank, to, of, the, in, is, happy, returns | many, thank, happy, returns, day, city, please, picasso, wishing, congratulations |
| 38 | Sender_039 | the, and, to, in, of, is, are, https, this, for | https, com, market, one, happy, birthday, congratulations, counterargument, youtu, best |
| 39 | Sender_040 | this, to, was, com, you, old, with, https, shoes, of | com, old, https, shoes, message, deleted, app, file, attached, mars |
| 40 | Sender_041 | you, thank, and, are, this, of, to, have, it, for | thank, guys, https, www, friends, done, end, work, sorry, year |
| 41 | Sender_042 | to, the, of, very, in, this, is, sunil, you, it | sunil, sunitha, thanks, product, way, go, much, batch, india, registered |
| 42 | Sender_043 | thank, you, for, the, sunil, sridhar, saurabh, vinod, everyone, and | thank, sunil, sridhar, saurabh, vinod, everyone, making, day, special, vikal |
| 43 | Sender_044 | in, to, the, this, for, of, was, from, you, it | please, message, deleted, com, congrats, years, congratulations, looking, mail, share |
| 44 | Sender_045 | thanks, am, with, thank, you, sunil, and, raman, been, very | thanks, thank, sunil, raman, long, sure, hello, everyone, got, guys |
| 45 | Sender_046 | congrats, that, and, file, attached, this, congratulations, best, awesome, thanks | congrats, file, attached, congratulations, best, awesome, thanks, lot, friends, wishes |
| 46 | Sender_047 | the, to, all, sunil, and, one, congratulations, with, for, you | sunil, one, congratulations, thanks, room, shalini, possible, happy, group, best |
| 47 | Sender_048 | many, happy, the, returns, of, day, sunil, you, all, wish | many, happy, returns, day, sunil, wish, guru, cnn, congrats, sunita |
| 48 | Sender_049 | to, the, we, all, is, this, of, and, signal, in | signal, pl, thanks, group, https, congrats, happy, join, one, best |
| 49 | Sender_050 | to, the, in, and, is, of, for, not, this, it | may, also, one, like, think, good, us, india, please, even |
| 50 | Sender_051 | to, the, you, happy, in, of, for, is, birthday, good | happy, birthday, good, thanks, god, friends, bless, mahesh, sunil, group |
| 51 | Sender_052 | to, and, the, all, you, of, congratulations, very, happy, awesome | congratulations, happy, awesome, mmhrotd, mahesh, many, great, good, stay, congrats |
| 52 | Sender_053 | the, to, and, in, is, of, it, for, you, with | https, sea, one, happy, birthday, com, file, attached, good, img |
| 53 | Sender_054 | this, message, was, deleted | message, deleted |
| 54 | Sender_055 | is, the, and, in, to, of, for, be, with, all | help, thanks, looking, please, take, hospital, experience, also, guys, need |
| 55 | Sender_056 | happy, to, you, and, birthday, the, of, thank, all, parenting | happy, birthday, thank, parenting, course, amazing, many, returns, day, vani |
| 56 | Sender_057 | the, and, to, in, of, is, this, was, for, india | india, good, thanks, biju, indonesia, patnaik, wishes, dutch, lot, know |
| 57 | Sender_058 | thanks, lot, for, the, wishes | thanks, lot, wishes |
| 58 | Sender_059 | is, supply, chain, microsoft, the, and, for, this, in, or | supply, chain, microsoft, india, hyderabad, coe, hiring, cloud, growth, capacity |
| 59 | Sender_060 | birthday, happy, and, have, year, ahead, to, great, congratulations, the | birthday, happy, year, ahead, great, congratulations, wishes, best, nice, mahesh |
| 60 | Sender_061 | happy, birthday, have, aparna, great, year, does, anyone, know, covid | happy, birthday, aparna, great, year, anyone, know, covid, recovered, person |
| 61 | Sender_062 | with, of, you, on, to, are, the, all, thanks, this | thanks, good, dr, tainwala, based, butters, hi, connect, wow, vikas |
| 62 | Sender_063 | the, to, it, and, but, of, for, is, have, this | happy, good, mahesh, congratulations, would, one, like, ilango, even, quite |
What themes dominate the group’s conversation?¶
Navigating the Memory Lane¶
Let's identify and visualize “throwback-triggering” messages — those that evoke nostalgia or reference shared past experiences. This helps us analyze collective sentiment, uncover moments of shared memory, and understand how nostalgia emerges and evolves in group chats over time. Such insights can reveal social bonding patterns, especially around events like school reunions, college anniversaries, old trips, or festive reflections, offering a deeper view into how the group recalls and relives its collective past.
To achieve this, we employ two complementary approaches:
- Throwback Detection Using Defined Keywords
- Keyword-Agnostic Throwback Detection Using NLP (semantic similarity)
Can we identify throwback posts using keywords?¶
This approach relies on a predefined list of nostalgic or memory-related terms—such as "remember", "trip", "college", "reunion", or "old days"—to identify messages that likely reference past events. By compiling these terms into a regex pattern, the system scans the chat dataset to filter out messages that explicitly include such words or phrases.
This approach is valuable for its simplicity and high precision: when users directly mention known nostalgic triggers, the system confidently classifies those messages as throwbacks. It allows quick insight into how often and when group members recall shared memories, providing a clear, quantifiable signal for analyzing emotional engagement and social bonding over time. However, it may miss subtler, indirect references to the past, which is where NLP-based methods complement this approach.
Total throwback-triggering messages: 471 Total users who shared throwbacks: 47
Can NLP uncover nostalgic messages?¶
To enhance the accuracy and depth of textual analysis—particularly in areas such as social media monitoring, customer feedback interpretation, and trend analysis—the Keyword-Agnostic Throwback Detection framework leverages a machine learning (ML) approach powered by natural language processing (NLP). Unlike traditional keyword-based methods, this solution identifies references to past events by analyzing contextual cues, temporal language patterns, and semantic similarities through advanced language models like BERT or RoBERTa.
This methodology enables the detection of subtle and implicit throwback mentions that static keyword filters often miss. As a result, it provides a more comprehensive understanding of how past topics resurface in current conversations, offering richer insights into consumer behavior, shifting sentiment, and long-term brand engagement.
Total throwback-triggering messages: 25
Total users who shared throwbacks: 15
Sample Throwback Messages with Scores:
SenderID Clean_Message \
10832 Sender_033 we have history
9404 Sender_035 nice memories
12909 Sender_006 nice to see all together! brings back memories!
9401 Sender_029 nice memories..
16969 Sender_062 wow! dont remember this pic. thanks for sharin...
throwback_score Datetime
10832 0.453242 2020-08-20 20:13:00
9404 0.661112 2020-06-28 10:01:00
12909 0.518130 2020-11-22 12:01:00
9401 0.668752 2020-06-28 10:00:00
16969 0.464323 2021-07-05 10:30:00
💡Data Insights
The textual content generated by the group offers a rich foundation for analyzing communication dynamics, sentiment, and engagement patterns. The following breakdown explores the volume, structure, and nature of messages to derive actionable insights that support community understanding and strategic decision-making.
Textual Data Volume & Quality
The group has produced a substantial volume of messages, forming a robust dataset for linguistic, thematic, and behavioral analysis. Through comprehensive preprocessing steps—including stopword removal, normalization, and text cleaning—the analysis focused on high-signal, meaningful content, increasing the accuracy and interpretability of the results.
Word Frequency & Thematic Trends
Analysis of the most frequent terms—excluding generic stopwords—highlights a strong presence of positively charged, socially supportive, and professionally oriented language. Common words such as “thanks,” “good,” “congratulations,” “happy,” and “connect” suggest a culture of gratitude and recognition, ongoing professional networking and milestone acknowledgments and Social cohesion built through consistent positive reinforcement.
Message Length & Communication Style
The group demonstrates a balanced communication pattern, blending short-form messages (for quick acknowledgments, emojis, or one-word affirmations) and longer posts (for detailed updates, reflections, and context-rich discussions). This dynamic allows for both efficiency and depth, accommodating various communication preferences and engagement levels.
User-Level Communication Patterns
Top contributors by average message length often lead extended conversations, share resources, or provide detailed input—playing central roles in shaping group dialogue. Brief communicators typically offer quick responses, indicating efficient, mobile-first interaction styles, which are still vital for maintaining group momentum. Additionally, user-specific keyword analysis reveals personalized communication styles (e.g., formal vs. casual tone), content preferences (e.g., medical topics, social events, mentorship), emerging influencer roles, such as conversation starters and frequent respondersKeyword & Topic Variety
The diversity in top keywords and message types points to a broad range of discussion themes, including professional updates, Peer recognition and celebration, Event planning and coordination, and Social bonding and informal interaction. This content richness reflects a multifaceted community identity that combines value-driven professional exchange with warm interpersonal connections.
Sentiment Signals
Frequent use of emotionally positive words reinforces the presence of a supportive and encouraging group culture. This language pattern signals:
- A community grounded in mutual respect
- A tendency toward celebrating shared success
- High levels of peer appreciation and motivation
Sentiment & Emotion Detection¶
What types of sentiments are present in messages?¶
🔍 Analytical Insights
For sentiment analysis, we will use VADER (Valence Aware Dictionary and sEntiment Reasoner), which is a rule-based sentiment analysis tool specifically designed to analyze sentiments expressed in social media and short text. Unlike traditional models, VADER uses a lexicon of sentiment-related words and incorporates rules to handle punctuation, capitalization, degree modifiers (like "very"), and slang. It outputs four scores—positive, negative, neutral, and a compound score (a normalized measure of overall sentiment). VADER is lightweight, fast, and works well out-of-the-box for texts like tweets, reviews, and chat messages.
| Message | Sentiment_Score | Sentiment_in_Message | |
|---|---|---|---|
| 0 | One of the most disturbing stories that we fin... | 0.9815 | Positive |
| 2 | Is this True? | 0.4215 | Positive |
| 3 | Best opposition leader to remain in opposition 😝 | 0.6369 | Positive |
| 4 | Don’t know but really funny. | 0.6474 | Positive |
| 5 | Sunil and Ramki can clarify if it’s true | 0.4215 | Positive |
| ... | ... | ... | ... |
| 28666 | Hey guys, So how many of us are coming to IIM... | 0.0000 | Neutral |
| 28667 | <Media omitted> *Friendship meets inspiration ... | 0.9767 | Positive |
| 28668 | Dear all …a query around engg admissions in Bl... | 0.7013 | Positive |
| 28669 | <Media omitted> Kela beku neevu. 👌🙏🌹😊 | 0.0000 | Neutral |
| 28670 | <Media omitted> *Wellness Beyond Buzzwords. Wh... | 0.9001 | Positive |
24234 rows × 4 columns
Which users post the most emotional messages?¶
What’s the emotional profile of each user?¶
How is sentiment distributed among users?¶
How many sentiments are detected per message?¶
💡Data Insights
The sentiment analysis categorized messages as positive, neutral, or negative, using both textual and emoji/emoticon cues to capture emotional nuance.
- Positive Sentiment: A substantial portion of messages reflected appreciation, encouragement, and celebration, indicating a supportive and upbeat group culture.
Neutral Sentiment: The majority of messages were neutral, emphasizing information exchange, coordination, and professional updates—aligning with the group's purpose-driven nature.
Negative Sentiment: Very few messages contained negative sentiment, underscoring a low-conflict environment with a strong sense of psychological safety.
Temporal Sentiment Trends: Analysis over time uncovered notable patterns:
- Consistent Positivity: Peaks in positive sentiment aligned with achievements, celebrations, recognitions, and key group milestones.
- Event-Driven Variability: Short-term increases in mixed or negative sentiment occasionally occurred in response to external events or difficult discussions, but these were rare and typically followed by a quick return to a positive baseline.
Emotion Detection & Expression Patterns
Emoji & Emoticon Use: The group frequently used emojis and emoticons, offering insight into non-verbal emotional expression. Emotions such as happiness, excitement, and approval were prevalent, evidenced by emojis like 😊, 🎉, 🙌, and 👍. Less frequent—but meaningful—emotional nuance (e.g., expressions of empathy, surprise, or concern) often appeared in response to personal updates, group concerns, or broader events.
Text-Based Emotion Signals: Emotion detection in written content supported the emoji-based findings. Recurring use of phrases like “congratulations,” “thank you,” “well done,” and “happy to connect” reinforced a tone of positivity and mutual encouragement. Textual emotion cues suggested high levels of community warmth, trust, and camaraderie.
Key Influencers & Positivity Drivers:
A subset of members consistently shared motivational, celebratory, or appreciative messages, serving as informal leaders and morale boosters. These users help set the tone for group culture and are central to sustaining positivity and momentum.Engagement Health Indicators:
The combination of low negative sentiment, frequent emotional validation, and broad participation in positivity indicates strong group cohesion, emotional resilience, and a culture of inclusivity and psychological safety.
Emoji Usage & Personality¶
What emojis are used most frequently?¶
🔍 Analytical Insights
Emojis with skin tone modifiers—such as 👍🏻 (Thumbs Up: Light Skin Tone) and 👍🏽 (Medium Skin Tone)—are treated as distinct Unicode characters, not just stylistic variations of the default emoji. Listing them separately in analysis is a best practice because it preserves the intentional choices users make to reflect identity, inclusivity, or cultural expression.
Aggregating these variants under the base emoji would obscure meaningful behavioral insights and distort usage patterns. By treating them as separate entries, analysts can accurately capture user intent, identify trends in personalization, and ensure a more inclusive and representative understanding of digital communication habits.
What relationships exist between emojis?¶
How do emoji sentiments change over time?¶
What can emojis tell us about message mood?¶
Which emojis are the happiest or saddest?¶
Loading emoji sentiment data from CSV...
User Personality Profiles from Emoji Use:
SenderID Top_Emoji_1 Top_Emoji_2 Top_Emoji_3 Personality_Type
0 Sender_003 ⚘⚘⚘⚘⚘⚘ None None Neutral
1 Sender_004 ➙ None None Neutral
2 Sender_005 ☬ None None Neutral
3 Sender_007 ✓ None None Positive
4 Sender_023 ✓✓ ✓ ✓✓✓ Neutral, Positive
5 Sender_056 ⚘ None None Neutral
💡Data Insights
The group demonstrates a high frequency and diversity of emoji usage, with members regularly incorporating emojis into their messages. This indicates a digitally fluent and expressive communication culture.
- Top Emojis: The most frequently used emojis include positive and celebratory symbols (e.g., 👍, 😊, 🎉, 🙏), which align with the group’s overall positive sentiment and culture of appreciation.
- Contextual Use: Emojis are used to reinforce tone, convey non-verbal cues, and add emotional nuance to both professional and social messages. This enhances clarity and reduces the risk of misinterpretation in text-based communication.
- Expressiveness and Openness: Frequent emoji users tend to be more expressive, open, and approachable. Their messages often set a friendly and inclusive tone, encouraging broader participation and engagement.
- Positive Reinforcement: The use of emojis such as thumbs up, clapping hands, and smiley faces is strongly associated with encouragement, recognition, and support. This fosters a psychologically safe environment where members feel valued.
- Individual Communication Styles: Analysis of emoji usage by individual members reveals distinct personality traits:
- Enthusiasts: Members who use a wide variety of emojis, often in combination, are typically seen as energetic, creative, and socially active.
- Minimalists: Members who use emojis sparingly may prefer direct, concise communication, reflecting a more reserved or task-focused personality.
- Bridge Builders: Some members use emojis strategically to bridge professional and personal topics, facilitating smooth transitions and maintaining group cohesion.
- Business and Community Implications
- Enhanced Engagement: The group's rich emoji culture contributes to higher engagement, as members feel more connected and understood. This is particularly valuable in remote or asynchronous professional communities.
- Cultural Sensitivity: The choice of emojis reflects cultural norms and shared values within the group, such as respect (🙏), celebration (🎉), and positivity (😊). This strengthens group identity and belonging.
- Communication Efficiency: Emojis enable quick, effective communication of emotions and reactions, reducing the need for lengthy explanations and streamlining group interactions.
Media & File Sharing¶
What types of files are shared?¶
--- File Types Found (Excluding .txt) --- .jpg: 487 file(s) - Images shared in the group .mp4: 109 file(s) - Videos shared (memes, events, recordings) .vcf: 18 file(s) - Contact cards shared .opus: 6 file(s) - Voice messages .webp: 13 file(s) - Stickers or compressed images .pdf: 7 file(s) - Documents such as brochures, notes, etc. .xlsx: 1 file(s) - Excel files, likely data or reports .csv: 2 file(s) - Comma-separated data files no_extension: 1 file(s) - No description available
When are files most frequently shared?¶
Are links shared regularly?¶
How large are the shared files?¶
What time of day are files most shared?¶
What kinds of images do people share, and can we figure that out automatically?¶
✅ Loading cached prediction results...
Media_Filename Image_Cat_Predicted Prediction_Confidence \
407 IMG-20190515-WA0006.jpg Event 0.424
1251 IMG-20190708-WA0002.jpg Announcement 0.207
2774 IMG-20191009-WA0008.jpg Meme 0.348
3017 IMG-20191016-WA0005.jpg Event 0.749
3079 IMG-20191018-WA0010.jpg Event 0.964
Image_Cat_from_Content
407 Event
1251 Announcement
2774 Meme
3017 Event
3079 Event
What themes are common in shared images?¶
--- Common Themes from Images --- Cluster 1 (78 images): • a man in a yellow shirt and a black shirt with a face mask on • a man with a light on his face • a man with a mohawk mohawk and a suit
Cluster 2 (102 images): • a tweet tweet tweet tweet tweet tweet t • a text that reads, ` ` ' ' ' ' ' ' ' ' ' ' ' ' ' • a screenshot of a text message from person
Cluster 3 (37 images): • a cartoon depicting the driver ' s face and the driver ' s face • a cartoon of two people riding on a motorcycle • a cartoon of a doctor talking to a patient
Cluster 4 (59 images): • a man and woman sitting at a table with a cup of coffee • a baby is being held by a woman in a hospital • a poster with a bunch of people on it
Cluster 5 (211 images): • happy diwali diya diya diya diya diya diya diya diya • a quote that says i don ' t know it ' s not the worst • mes, no one going up the ta that tree of bollywood ' s most popular villains are literally
Can we group images based on predicted captions?¶
💡Data Insights
The caption embeddings are grouped into a predefined number of clusters (e.g., 5) using the KMeans algorithm, which assigns each image to a single cluster based on the semantic similarity of its caption to others. This results in distinct image categories, where each cluster represents a dominant theme or concept derived from the captions.
The "Media & File Sharing" analysis demonstrates that the WhatsApp group has successfully leveraged multimedia content to create a rich, engaging, and informative communication environment. This approach not only enhances day-to-day interactions but also supports professional development, knowledge sharing, and community building. By continuing to encourage quality media sharing and recognizing key contributors, the group can sustain its dynamic and valuable digital community.
Media Usage Patterns and Volume
- Diverse Media Types: The group demonstrates a rich and varied media sharing culture, with members regularly sharing images, videos, documents, and other file types. This indicates a dynamic and resource-rich communication environment.
- Volume and Frequency: The analysis reveals significant media sharing activity, with members leveraging visual and multimedia content to enhance their messages and provide context. This suggests a preference for rich, engaging communication over text-only interactions.
Content Categories and Themes
- Professional Content: A substantial portion of shared media includes professional documents, presentations, and informational graphics. This reflects the group's focus on knowledge sharing, collaboration, and professional development.
- Social and Celebratory Content: Images and videos related to events, celebrations, and personal milestones are frequently shared, indicating a strong social bond and culture of recognition within the group.
- Educational and Informational: Members share educational content, news articles, and informational videos, demonstrating a commitment to continuous learning and staying informed about relevant topics.
User Behavior and Engagement
- Active Contributors: Certain members emerge as key media contributors, regularly sharing high-quality and relevant content. These individuals play a crucial role in maintaining group engagement and providing valuable resources.
- Engagement Drivers: Media-rich messages typically receive higher engagement (likes, comments, responses) compared to text-only messages, highlighting the importance of visual content in driving interaction and participation.
- Strategic Business Implications
- Enhanced Communication Effectiveness: The group's media sharing culture enhances communication effectiveness by providing visual context, reducing ambiguity, and making complex information more accessible and engaging.
- Knowledge Management: The regular sharing of documents and informational content supports collective knowledge building and ensures that important information is widely accessible to all members.
- Community Building: Social and celebratory media content strengthens group bonds, fosters a sense of belonging, and creates shared memories that enhance group cohesion.
Mention & Email Analysis¶
Who mentions others most frequently?¶
What email domains are most common?¶
Demographic Patterns¶
This section explores the demographic composition of group participants based on available attributes, including age, gender, city, country, and highest education level. Analyzing these factors helps uncover patterns in group diversity, participation trends across different demographic segments, and potential correlations between user characteristics and messaging behavior. These insights provide valuable context for interpreting group dynamics and tailoring engagement strategies.
🚫 Limitations
In the absence of additional user-provided details, this analysis estimates the geographic distribution of users based solely on the country codes extracted from their phone numbers. While this provides a reasonable approximation of user location, it may not reflect actual residency or physical presence.
Where are users geographically located?¶
💡Data Insights
Mentions per SenderID
Sender_036 and Sender_011 are the most frequently mentioned individuals in the group (15 and 14 mentions respectively), suggesting high visibility or central roles in discussions.
A long tail of users with fewer mentions indicates a typical core-periphery structure, where a few individuals are frequently acknowledged and many contribute sporadically.
Mentions may correlate with Leadership or organizational roles, Contributions to shared knowledge or group events, and Social or professional influence within the group.
Most Frequently Shared Email Addresses
A small number of email addresses are shared repeatedly, indicating trusted points of contact or recurring professional exchange (e.g., resource sharing, onboarding, event organization). Such email addresses are most likely shared for Networking or collaboration, Professional inquiries or follow-ups, and Event logistics or coordination
Most Shared Email Domains
gmail.com dominates, reflecting the use of personal email accounts for communication and sharing.iimb.ac.in stands out as the most shared institutional domain, reinforcing the academic or alumni context of the group. Other domains like bridgepeople.in, tresorfit.com, zopperinsurance.com, and iimbaa.club** suggest entrepreneurial activity, organizational affiliation, and ongoing professional ventures tied to group members.
© 2025 Saurabh Kudesia
This project is licensed under the MIT License. You
are free to use, modify, and distribute this code, provided you include proper attribution and retain
the license notice.